A mixed model approach for joint genetic analysis of 
alternatively spliced transcript isoforms using RNA-Seq data 



Barbara Rakitsch^, Christoph Lippert^, Hande Topa*^, Karsten Borgwardt^''*, Antti Honkela^, and 

Oliver Stegle^ 

^ Max Planck Institutes Tiibingen, Tiibingcn, Germany, 
{rakitsch.borgwardt , steglejOtuebingen.mpg. de 
^ Microsoft Research, Los Angeles, California, USA 
lippertOmicrosof t . com 

^ Helsinki Institute for Information Technology HUT, Department of Information and Computer Science, 

Aalto University, Helsinki, Finland, 
hande . t opaOaalt o . f i 

* Zentrum fuer Bioinformatik, Eberhard Karls Universitat Tiibingen, Tiibingen, Germany 
^ Helsinki Institute for Information Technology HUT, Department of Computer Science, 
University of Helsinki, Helsinki, Finland, 
antti .honkelaOhiit . f i 



Abstract. RNA-Seq technology allows for studying the transcriptional state of the cell at an 
unprecedented level of detail. Beyond quantification of whole-gene expression, it is now possible 
to disentangle the abundance of individual alternatively spliced transcript isoforms of a gone. 
A central question is to understand the regulatory processes that lead to differences in relative 
abundance variation due to external and genetic factors. Here, we present a mixed model approach 
that allows for (i) joint analysis and genetic mapping of multiple transcript isoforms and {ii) 
mapping of isoform-specific effects. Central to our approach is to comprehensively model the causes 
of variation and correlation between transcript isoforms, including the genomic background and 
technical quantification uncertainty. As a result, our method allows to accurately test for shared 
as well as transcript-specific genetic regulation of transcript isoforms and achieves substantially 
improved calibration of these statistical tests. Experiments on genotype and RNA-Seq data from 
126 human HapMap individuals demonstrate that our model can help to obtain a more fine-grained 
picture of the genetic basis of gene expression variation. 
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1 Introduction 

Motivation Large-scale genotyping and expression profiling initiatives have fostered expression quan- 
titative trait loci (eQTL) analyses, investigating the genetic component of the transcriptional state of 
the cell. In human, cQTL studies have revealed an abundance of genetic associations between proxi- 
mal polymorphic loci and individual genes |27I28I20I18] . In model organisms, such as segregating yeast 
strains |4l24j . mouse [22] or Arabidopsis thaliana |30j . studies have uncovered a map of the genetic 
component of gene regulation, which is characterized by an interplay of local genetic associations in cis 
and distal genetic effects with trans mechanisms. The majority of existing studies of transcription were 
based on microarray technologies, which are limited to a coarse quantification of the total transcript 
abundance of a-priori known genes present in a sample. 

Only recently, thanks to second generation sequencing techniques, deep transcriptome sequencing 
(RNA-Seq) has become viable even for population-scale analyses. Seminal work |18|20l has demon- 
strated the merits of using the digital readout provided by RNA-Seq technologies for genetic analyses, 
allowing for a more comprehensive dissection of the genetic component of transcriptional variability. 
RNA sequencing not only allows for an unbiased genome-wide quantification of transcription at an 
unprecedented resolution, it also enables the detection and quantification of alternative splice forms 
of genes |19|29j . Differences in the abundances of individual isoforms of a gene have been shown to 
play an important role for phenotypes in specific tissues. For example, genetic modifications that affect 
alternative splicing and post-transcriptional quality control mechanisms are postulated to be involved 
in a large proportion of genetic disorders, including the development of cancer [17123191615] . 

By treating transcript isoform abundances as quantitative traits it is possible to perform eQTL 
studies on the level of individual isoforms. Such an approach allows for detecting polymorphic loci 
affecting the expression level of specific isoforms, for example via regulation of splice factors. While there 
have been previous attempts to tackle this problem at the level of individual exons |18I20| , there is a lack 
of statistical models that allow for accurately dissecting the genetics of transcriptional gene-regulation at 
the level of individual transcript isoforms. Such methods face various challenges that need to be addressed 
in order to get meaningful results: First, the expression levels of individual transcript isoforms are highly 
correlated due to common transcriptional regulation as well as shared exons and splicing mechanisms 
between different isoforms. Second, limited identifiability of individual transcript isoforms from RNA- 
seq data causes correlation between isoform abundance estimates, which can confound the analysis if 
not accounted for. Finally, as in almost any genomic analyses of quantitative traits, there is the need to 
account for structure within the sample, for example due to shared ancestry or population clustering. 

Our goal in this article is to present a probabilistic model that allows for detecting (i) genetic loci 
that affect the overall expression level of multiple transcript isoforms and (ii) loci that act in an isoform- 
specific manner, while addressing the aforementioned challenges. 

Related approaches An alternative paradigm to what we present here are approaches that perform 
the analysis on the level of single exons instead of complete isoforms. Such analyses approaches have 
been used in the context of genetic perturbations [18)20j and other factors [2]. While such an approach 
imposes fewer assumptions on the processes that are involved, individual exon-specific regulation needs 
to be retrospectively combined to a transcript-level view. Furthermore, sensitive transcript expression 
quantification can borrow statistical strength across exons of the same transcript, potentially picking 
up more subtle variation (See also discussions in |8ll2j ). 

Conceptually, our approach is related to multi-trait mixed models, which have undergone a long 
history of development (see e.g. |10l26l21j ). Recently, these approaches have been successfully applied 
to genome-wide association studies of correlated traits and have been shown to increase power to detect 
pleiotropic SNP effects on two correlated traits as well as being able to detect environment-specific 
effects of SNPs on a phenotype measured under two different environments |13j . 

Contributions of this paper In this work we propose a mixed model to jointly map individual 
isoforms from alternatively spliced genes and to uncover genomic variants that affect all or specific 
transcript isoforms. By employing a state-of-the-art approach to infer expression levels of individual 
isoforms 8 , we build a problem-specific background covariance that explains confounding genetic and 
technical correlation between transcript isoform abundance estimates of the same gene. We apply the 
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resulting mixed model to search for proximal cis associations in 126 HapMap individual where we 
find extensive regulation at gene level but also genetic effects that alter the expression abundance in a 
transcript-specific manner. 

Besides a fine-grained picture of the genetic basis of gene expression, another key advantage of the 
proposed approach is that it is convenient to accurately measure statistical significance in our model, 
p-values for association can be computed analytically through likelihood ratio tests, without the need 
for expensive permutation experiments to assess statistical significance. By controlling for sources of 
confounding our model docs not suffer from p-value inflation, as it is observed in simpler methods, and 
thus provides better control for the type-1 error rate. 



2 Transcript expression inference from RNA-Seq 

Transcript isoform expression can be only indirectly reconstructed from RNA-Seq datasets. Here, we 
build on BitScq [8 , a fully probabilistic approach to estimate transcript isoform expression levels from 
RNA-Seq data and a known transcript annotation. 

Briefly, the generative model of the RNA-seq reads in each sample assumes a categorial assign- 
ment of reads to one of all genome-wide transcripts or a noise category according to p{Ij' \ 6*") ~ 
Cat(/" I {0", . . . where 0" is the probability of observing a fragment of transcript isoform t in 

sample n, and M denotes the total number of annotated transcript isoforms. The likelihood of a read rj 
given a transcript assignment, p{rj \ Ij = m) = p{rs \ seqmps)p{p \ m)p{s \ m), is given by the probability 
of drawing a read from the specific position in the transcript isoform m while accounting for mismatches, 
position and sequence bias correction. Given an observed set of reads, we can apply Bayesian Markov 
chain Monte Carlo inference to obtain a full posterior probability distribution over the transcript iso- 
form relative expression levels 6*" . These are transformed to log-expression levels — log 6'" for further 
analysis. 

For each gene, we use the Monte Carlo estimates to obtain summary statistics that capture the 
mean expression level and covariation between the abundance estimates y" of the Tg different tran- 
scripts of each gene g within each sample n. Modeling covariation is useful because transcript iso- 
forms often share sequence which causes ambiguity in assigning reads to individual isoforms, and the 
transcript expression posteriors often show strong negative correlation (See also [8]). These transcript 
isoform means and between-isoform expression covariance summaries are then used for further mod- 
eling. This first and second moment of the variation for each gene is explained by a Gaussian over 
y" = {yr, . . . , y?J,p(y | V) = J\f{y \ y, C"). 

Along the same lines, the same Monte Carlo estimates can be use to estimate whole-gene expression, 
ignoring the transcript structure. For a specific gene g in sample n, expression levels can be estimated 
as Ug = log X^teT^"' where T is the set of all transcripts belonging to the gene g. These estimate yield 
RKPM estimates that are near-identical to standard counting-based methods. 



3 A multi-isoform mixed model 



To differentiate joint effects from isoform-specific regulation it is important to account for different 
sources of correlation between multiple transcript isoforms of the same gene. Here, we propose a multi- 
isoform variance component model that comprehensively accounts for different origins of correlation. 
First, multiple transcript isoforms of the same gene are correlated across samples because of genetic 
factors that affect multiple transcripts in the same way. In the absence of transcript-specific splicing 
regulation, this variation is dominated by a common gene expression component. Second, there is tech- 
nical covariation between the isoforms quantified in the same sample (see Section [2]). 

In Section |3.1[ we introduce a linear mixed model that accounts for both of these types of genetic 



and technical factors that induce correlation between transcripts. In Section 3.2 we show how the model 
can be used to test for common and specific genetic effects of a SNP (single nucleotide polymorphism). 



4 Rakitsch et al. 



3.1 Multi-isoform linear mixed model 



For a specific gene (we drop tlie dependence on g to unclutter notation), let ttie A'' • T-dimensional 
vector Y = [y^, • ■ • ^Vt] denote tlie vector of log expression levels for T different isoforms of a single 
gene measured in each of the N samples. For each t e [1, . . . ,T], we model the A^-dimensional vector 
j/( = , . . . , y^) holding all samples of isoform i by a linear model as follows: 

1^ +x^^^+ ^Xi-Pi^t (1) 

Transcript moan SNP effect ^^PQP ^ ^ noise 

population structure 

Here, fit is the bias for transcript t, x* is a SNP we would like to test for association, for example in the 
vicinity of the gene. The SNPs {a;;}igpop are genome- wide markers that capture population structure, 
excluding SNPs within a window of ±1 mega base around the gene to avoid proximal contamination |16j 
when testing SNPs that lie in cis. The genetic cis effect P* is modeled as a fixed effect. The population 
structure effects {/3z,t}igpop are modeled as random. To allow for correlation between the effects of 
population structure across all isoforms, the T-dimensional vector, /3; = . . . , /3;_t] is coupled across 
transcript isoforms by an identical multivariate normal distribution with covariance V''"^, for each SNP 
xi used to represent population structure. 



/3, -AA(0,yP°P); yP°P = 



The T(T + \)/2 independent entries in 1/^°^ are inferred from the data within the model. For genes 
with a large number of transcript isoforms, the number of parameters may become prohibitive, in which 
shrinkage approaches such as Lasso regulation on the inverse covariance [7] could be used; see also 
Discussion. 

The observational noise is independent across individuals, but correlates across isoforms within 
each individual due to technical variation. For every individual n, we let the T-dimensional vector 
■0" = [?/;", . . . , ■4>lf\ be the noise across all isoforms, which is distributed as 

V»"-AA(0,diag(5) + a'C"), (3) 

where 5 = [(5^, . . . , is residual noise level of each isoform, C" denotes the technical correlation 
matrix between the isoforms due to quantification in sample n and is a common scaling parameter 
of the technical covariation. The technical correlation matrices C" are estimated a-priori by the BitSeq 
algorithm (see Section p|. The noise- variances [Sf, . . . ,d^], and are estimated within the model. 



-'l.T 



(2) 



Marginal model Integrating over the effects of population structure (3i^t for all I and t we get a joint 
marginal likelihood for all isoform levels of the gene: 
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and the noise covariance 

SlI " 
6^1 

V ' 

technical noise noise 

where Dij = diag([Cj^^, . . . , Cj^j]). Note that the noise covariance J^"°''''= is a sum of a diagonal matrix 
for observation noise and a chessboard pattern matrix consisting of diagonal matrices that couple dif- 
ferent transcript within individual samples n due to technical noise. For an illustration of the effective 
covariance for a simple two-transcript gene, see Figure [T] Note that in the case of a gene with a single 
transcript, this variance component collapses to a standard linear mixed model with realized relationship 
matrix (see for example |11I15| ). 
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Fig. 1: Illustration of the marginal covariance model for a gene with two transcript isoforms. The to- 
tal covariance of the stacked transcript expression profiles are decomposed into a genetic component, 
technical covariance and transcript-specific observation noise. 

We estimate the variance parameters of the model by maximizing the marginal likelihood Q us- 
ing constrained quasi-Newton optimization. To avoid prohibitive computations due to re-estimation of 
covariance parameters for different cis SNPs x*, we estimate all parameters only once for each gene, 
ignoring the fixed effect of the cis SNP, Analogous approximations, fitting the background covariance 
parameters on the null model, have previously been successfully applied for GWAS [TT]. 
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3.2 Mixed model to test for common and transcript isoform-specific genetic effects 

We would like to test for the effect of individual cis SNPs that either have a common effect on all 
transcript isoforms or act in an isoform-specific manner. For this purpose, we use the parameters obtained 
from the multi-isoform variance components model Eqn. Q in uni-variate testing. We construct a linear 
mixed model using different designs modeling the effect of x* as either common genetic effects across 
transcript isoforms or isoform-specific genetic effects: 
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Here, the joint effect component is the identical SNP x* rephcated over each transcript isoform. From 
this hnear mixed model, different likelihood ratio tests can be carried out for each SNP in cis, using 
the computational tricks proposed in the FaST-LMM algorithm [T5] . In the implementation, we specify 
the relatedness matrix to be the sum of If p°p and K^^°^^° estimated on the null model. This approach 
enables us to obtain p-values from a distribution. Related testing strategies, however focusing on 
pairs of quantitative traits in GWAS, have been previously used in |13j . 

Gene-level association test In order to test for joint effects of SNPs that act on transcription on a 
gene-level, namely x* that jointly regulate all transcript isoforms in the same direction, we fit the mixed 
model in Equation ([s]) only with the joint effect, discarding transcript-specific effects. This alternative 
model is compared to a null model by additionally dropping the joint effect and performing a one degree 
of freedom likelihood ratio test. 

Isoform-specific association test In order to test for specific effects of SNPs, i.e. whether a SNP x* has 
an effect that acts specifically on a transcript isoform t, we fit the full mixed model in Equation ([s]), 
placing the specific effect only on the abundance of this isoform . This alternative model is compared 
to a null model that drops the specific effect. By conditioning on the joint effects we ensure that we only 
retrieve effects that act differently between the transcript isoforms. This results in a likelihood ratio test 
with one degree of freedom. 

Combined association test Finally, we consider a multi-transcript test that assesses whether there is any 
association with a transcript isoform t, either by jointly regulating all isoforms or by acting specifically 
on the isoform t. The alternative model is the same as when testing for isoform-specific tests. However, 
in this case we drop both, the joint effect as well as the specific effect from the null model. In this case 
the likelihood ratio test has two degrees of freedom. 

In the following we will denote these three testing strategies when using the multi transcript co- 
variance structure fitted on the null model (Section |3.1[ ) as multi- isoform mixed model (MIMM). We 
also consider alternative methods for the purpose of comparison, for example without the multi trait 
structure and the technical noise covariance. 

4 cis QTL mapping of transcript isoform levels in HapMap populations 

We applied the multi-isoform mixed model (MIMM) and alternative methods toeinstellbar previously 
published RNA-Seq data of human HapMap samples [T] , combining the data from Pickrell et. al [20] and 
Montgomery et. al [T5^. Altogether, there were 59 samples from the CEU population and 74 samples from 
the YRI HapMap population that could be mapped to unique HapMap identifiers. SNPs were filtered 
for minimum allele frequency 0.05 and we discarded variants with missing genotyping information in 
more than 5% of the samples. Finally, we also discarded all samples with more than 5% missing SNPs. 
After filtering, our dataset comprised of a total of 126 individuals and 996,158 SNPs. 

Transcript quantification was done using BitSeq (Section [2|, where we used the GRCh37.p68 refer- 
ence transcriptome from Ensembl and mapped the RNA-Seq reads with bowtie T^. As the focus of this 
work resides on transcript isoform-specific effects, we only considered genes with at least two transcript 
isoforms. We used the marginal uncertainty estimates from BitSeq (Section [2]) to filter out transcript 
isoforms that were consistently not expressed or difficult to quantify in 90% of the samples or more 
(z-score cutoff 1.5). Because of the small sample size in these datasets, genes with more than 4 active 
isoforms were discarded, ensuring that the comparison of methods was not compromised by sample size. 
This final dataset consisted of 5,954 genes, comprising of a total number of 16,340 individual transcripts. 

Statistical tests for proximal cis effects were carried out in a 1 Mb window upstream or downstream of 
each gene start and stop. In both the MIMM and ordinary mixed models, we considered for comparison, 
we cut out this region for building the multi-isoform relationship matrix K^°^ to avoid possible loss 
of power due to proximal contamination [16j. We also included an indicator variable distinguishing 
between the YRI and CEU population to account for this low-rank component of population structure. 
Estimates of a local false discovery rate for each gene were obtained from the Benjamini & Hochberg [3] 
procedure to estimate q-values. 
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4.1 Testing for general regulatory effects on whole gene expression and transcript 
isoform levels 

First, we wanted to relate results obtained by transcript-based modeling, as proposed here, with associ- 
ations from an eQTL scan on whole gene expression, as considered in previous analyses |18l20j . To this 
end, we applied the multi-isoform mixed model (MIMM) and performed the combined association test to 
find genetic effects that are either specific to individual transcript isoforms or act on the gene-level. We 
compared these results with a linear mixed model (LMM), carrying out standard eQTL tests on whole 
gene expression levels that were also estimates from BitSeq in a manner consistent with the transcript 
level estimates (Section [2]) . Many but not all genes that had at least one significant cis association by 
one of the methods (g-valuc threshold of 5%) were also detected by the other approach (see Figure [2] 
(a)). This result suggests that the combined association test on a transcript level is capable of retriev- 
ing associations that correspond to established eQTLs but also other signals. As expected, both models 
yielded associations with a strong enrichment for lying in close proximity to the gene start (Figures [2] (b) 
and (c)), suggesting that many of these hits are genuine QTLs. We also carried out genome- wide scans 
for a random selection of 49 of genes to investigate calibration of p- values, finding that both methods 
yielded calibrated p- values (genomic control: A = 1.02 MIMM, A ~ 1.03 LMM). Summary results are 
shown in Table [I]). 



O.lOi , 1 0.10 




(a) LMM gene-level Efrects(b) Linear Mixed Model: (c) Multi-isoform Mixed 
(red) vs. MIMM combined as-whole gene expression Model: combined association 

sociation test (blue) test 



Fig. 2: Comparison of significant results (q- value threshold 5%) of the MIMM model, testing for combined 
associations, and a standard linear mixed model applied to whole gene expression levels, (a) Mutual 
overlap of associations, (b-c) Density of the distance to the gene for cis associations. 



4.2 Dissecting genetic associations specific to individual transcript isoforms 

Next, we strived to dissect the set of associations retrieved by the combined association test into as- 
sociations that are either common to all transcripts, i.e. occur at a gene-level, and transcript-specific 
genetic effects. To this end, we used the MIMM and additionally carried out gene-level association tests 
and isoform-specific association tests in the same local cis regions as before. The overlap of results 
obtained from the three alternative testing strategies are shown in Figure [3j The combined association 
test retrieved the greatest number of genes with at least one association, overlapping with many of the 
hits from either the gene-level test or the transcript level test. Notably, genes with an associations found 
by the gene-level test and the transcript-specific test were almost completely disjoint, suggesting that 
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I # significant genes | # significant isoforms | # significant SNPs | Genomic Control (A) 
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Table 1: Summary statistics of the associations retrieved by alternative methods on different expression 
traits. Top row: Gene expression test, carried out on whole-gene expression estimates ignoring transcript 
structure. Bottom row: Transcript expression tests, carried out on all transcripts within a gene jointly 
using alternative testing strategies. 




Fig. 3: Overlap of associations retrieved by the multi-isoform mixed model when considering alternative 
testing strategies. 



within individual genes there is a strong tendency for either gene-level regulation or transcript-level 
regulation. 

Perhaps the most relevant use of multi transcript-isoform models is the association analysis of 
isoform-specific regulatory effects. Figure |4] (f) shows the distance of the 707 isoform-specific associ- 
ations relative to the transcript start site, showing a similar enrichment as observed for gene-level 
associations (Section 4.1 ). Again, we considered a selection of transcripts and carried out genome- wide 
scans to investigate calibration of p- values (Figure [4] (c)). This apparent calibration of tests (A = 1.00) 
when using the MIMM is crucially linked to the fitted multi trait covariance structure, accounting for 
transcript correlation and different sources of noise (Section 3.1). To explore the impact of this mod- 
eling, we also considered simpler alternative covariance structure, either as used in a standard linear 
mixed model (LMM) or a linear model without any background covariance (LIN). At a g- value cutoff of 
5%, the linear model and the linear mixed model retrieved a greater number of genes with at least one 
transcript-specific association (Table [T]). However, these absolute numbers need to be put in context with 
the statistical calibration of different methods: Figure |4^-c show genome-wide QQ-plots for a random 
selection of 121 transcripts (49 genes) tested for association by each of the considered methods. The 
excess of small p-values retrieved by the linear model (A — 1.12) and the linear mixed model (A = 1.73) 
suggests that the large number of findings is due to severely inflated the test statistics and hence the 
majority of these hits are likely to be false positives rather than actual true associations. This hypothesis 
is further supported by the lack of enrichment of associations in proximity to the transcription start 
site (Figure |4]l-f, where only the multi-isoform mixed model yielded hits consistent with the hypothesis 
that true regulatory variants are likely to be within the gene body. Similar deficiencies were also found 
when testing for joint genetic effects (See Figure 5). 

We also investigated the distribution of the number of isoform-specific associations per gene. The 
majority of genes had either a single transcripts that was under cis genetic regulation (38.00%) or had 
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two regulatory events (59.67%). Only a small minority of genes had more complex regulation involving 
more than two transcripts (2.33%). 

Two example genes, one with a joint genetic effect and a transcript-specific effect are shown in 
Figure|6] This illustrations demonstrates how different testing approaches provide a finer-grained picture 
of the genetic regulation of gene expression, both at the gene- and at the transcript level. 




(a) QQ-pIot Linear Model (b) QQ-plot Linear Mixed Model (c) QQ-plot Multi-isoform Mixed 

Model 
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(d) Association distance: Linear(e) Association distance: Linear(f) Association distance: Multi- 
Model Mixed Model isoform Mixed Model 

Fig. 4: Comparison of alternative methods to detect transcript-specific associations in proximal cis re- 
gions. (a)-(c): Quantile-quantilc plots for genome-wide scans of a selection of 121 traits using alternative 
methods, (d)-(f): Histogram of the distance of the most associated variant to the transcription start 
cite for significant hits (FDR 5%) . The multi-isoform mixed model achieved the best calibration of test 
statistics and retrieved cis associations with an enrichment near the transcription start site. 



5 Conclusions 

Transcriptional and post-transcriptional regulation are complex biological processes, both of which are 
under genetic control. Here, we have introduced a mixed model approach to test for genetic effects that 
either act on all transcripts jointly or alter expression levels of specific transcript isoforms while others 
remain unaltered. 

In a proof of concept application to RNA-Seq profiles from 126 HapMap samples, we have demon- 
strated substantial benefits of this approach compared to established analysis strategies. First, we have 
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(d) Association distance: Linear(e) Association distance: Linear(f) Association distance: Multi- 
Model Mixed Model isoform Mixed Model 

Fig. 5: Comparison of alternative methods to detect joint associations in proximal cis regions, (a)-(c): 
Quantile-quantile plots for genome-wide scans of a selection of 121 traits using alternative methods. 
(d)-(f): Histogram of the distance of the most associated variant to the transcription start cite for 
significant hits (FDR 5%). The linear mixed model and the multi-isoform mixed model showed improved 
calibration over the linear model, while the multi-isoform mixed model detected most cis associations 
near the transcription start site. 

shown that accounting for covariation between transcripts within the same gene, both due to genetic 
and technical factors, greatly improves statistical calibration of p- values for different types of association 
tests. The mixed model covariance structure we propose is tailored to to the analysis of estimates of 
multiple transcript isoform abundances. While general multi-trait mixed modeling has a long history 
for joint modeling of multiple traits (e.g. |10i26»21il3j ). we are not aware of any practical application to 
larger sets than pairs of traits. 

Second, we have shown how different testing strategies on the level of transcript isoforms yield more 
detailed mechanistic insights compared to existing approaches that operate on a gene level, while not 
loosing power. Using statistical tests that correspond to specific regulatory hypotheses, we were able to 
dissect this catalog of general cis associations into common genetic effects operating across transcripts 
and genetic effects that act in a transcript- specific manner. While common regulatory effects are most 
frequent, we have found evidence for transcript-specific regulation in 7.21% of genes; in some instances 
both types of effect were even found within the same gene. 

In conclusion, we have shown how the combination of multi-trait mixed models and probabilistic 
transcript quantification is able to uncover novel biological insights while providing statistically robust 
estimates. Currently available RNA-Seq datasets are limited, both in sample sizes and in terms of 
read lengths for reliable isoform quantification. Because of these limitations, we focused on medium- 
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(c) Combined association test 



(d) Combined association test 
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Fig. 6: Left side: Gene rphSal (ENSG00000181031) has an joint association for all three transcripts 
ENST00000331302(blue), ENST00000323434(green), ENST00000572965(red). Right side: Gene alpl 
(ENSG00000162551) has two distinct transcript-specific association for the transcript ENST00000540617 
(cyan). The first one is shared with the transcript ENST00000425315 (red), while the second is not. The 
other two transcripts show no significant association. 



complexity genes with 2-4 transcripts. More complex transcript structures may be fit by employing 
shrinkage priors, regularizing the effective number of parameters in the model |7I25| . In the future, 
larger datasets of better quality will render these tasks easier and hence models as the one proposed 
here will gain wide-spread applicability to many of the RNA-Seq eQTL studies to come. 
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